fast-interp: relaxed-SIMD opcode lowering by matthargett · Pull Request #4950 · bytecodealliance/wasm-micro-runtime

matthargett · 2026-05-21T19:42:42Z

Implements the 20 relaxed-SIMD sub-opcodes (0x100..0x113) in the fast-interp
HANDLE_OP(WASM_OP_SIMD_PREFIX) switch and adds a WAMR_BUILD_RELAXED_SIMD cmake
flag (default off — opt-in). Currently those sub-opcodes hit the
"unsupported SIMD opcode" arm at wasm_interp_fast.c:7474. Hand-built
implementations for the four ops SIMDe doesn't ship (relaxed_q15mulr_s + the two
relaxed_dot_i8x16_i7x16* variants); the rest route through simde/wasm/relaxed-simd.h.

Why we built this: we're replacing WasmEdge with WAMR fast-interp as the wasm
runtime in a pure-interpreter App-Store-eligible app, and the audio DSP
path (a modified version of xmrsplayer) uses f32x4.relaxed_madd to reach the interpreter-only performance that we need. Without this, fast-interp traps at load on every simd128 workload we have that we introduced to reduce opcode dispatch pressure/overhead in interpreters.

Test coverage — three layers, 174 conformance checks:

WAMR unit tests in tests/unit/relaxed-simd/ (load + invoke + boundary regressions).
32 hand-rolled abuse cases + 76 differential comparisons against wasmtime's
Config::relaxed_simd_deterministic(true) mode in our benchmark repo. The
diff-fuzz layer caught a spec-violating impl of
i32x4.relaxed_dot_i8x16_i7x16_add_s before submission — an off-by-i16-
truncation that produced lane values outside the spec-allowed set. The
upstream spec testsuite did not catch it: every existing assertion stays
within the i16 pair-sum range. Fix is in this PR; corresponding spec-test
addition at WebAssembly/relaxed-simd#164.
69 upstream WebAssembly/relaxed-simd spec-testsuite assertions wired up
through fast-interp with (either …) membership semantics.

Cross-microarch benchmarks (M4 Lion P / Sawtooth E / A14 Icestorm / A12 Tempest /
S8 Watch SE2) at
https://github.com/rebeckerspecialties/wasm-benchmark/blob/claude/relaxed-simd-diff-fuzz/README.md#cross-runtime-results-across-apple-silicon-e-cores .
ASan + UBSan + fuzzing part of my local dev loop to find corner cases.

Companion PR: legacy exception support #4949

…-SIMD The relaxed-SIMD proposal — finalized as a wasm 2.0 extension — uses the same 0xfd SIMD prefix and reserves sub-opcodes `0x100..0x113` for its 20 new ops: relaxed_swizzle, relaxed_trunc_{f32x4,f64x2}_{s,u}, relaxed_madd / relaxed_nmadd for f32x4 + f64x2, relaxed_laneselect for i8 / i16 / i32 / i64, relaxed_min / relaxed_max for f32x4 + f64x2, relaxed_q15mulr_s, relaxed_dot_i8x16_i7x16_{s,_add_s}. This commit lays the loader-side validation needed to *recognize* these opcodes without changing dispatch / runtime behaviour: * `WASMSimdEXTOpcode` enum (wasm_opcode.h) extended with the 20 new constants at the spec-assigned values 0x100..0x113. Gated behind `WASM_ENABLE_RELAXED_SIMD != 0` so a build without the cmake flag (added in a follow-up commit) sees no new symbols and the enum's storage is unchanged. * `wasm_loader_find_block_addr` SIMD-prefix immediate skipper (wasm_loader.c:8273-8363) — the inner switch is now on the raw LEB-uint32 sub-opcode instead of the `(uint8)` cast, so relaxed-SIMD sub-opcodes 0x100..0x113 reach their own case labels instead of aliasing into legacy slots 0x00..0x13 and triggering wrong `skip_leb_*` paths. Relaxed-SIMD opcodes carry no immediates beyond the prefix, so the new cases just `break` — listed explicitly so a future SIMD-spec assignment in 0x100..0x113 doesn't silently fall through the default branch and silently mis-skip an immediate. Cast assignment to the outer `opcode` variable removed since it's no longer used by the inner switch (the outer-function switch already matched `WASM_OP_SIMD_PREFIX` and is inside that case). * `wasm_loader_prepare_bytecode` SIMD-prefix type checker (wasm_loader.c:16186+) — extended with type-signature case labels for each relaxed-SIMD opcode. Three signature classes: unary (1 v128 -> 1 v128): the four trunc variants. binary (2 v128 -> 1 v128): swizzle, min/max, q15mulr, dot_i8x16_i7x16_s. ternary(3 v128 -> 1 v128): madd, nmadd, laneselect, dot_i8x16_i7x16_add_s. The 3-input ternary shape uses `POP_V128()` + `POP2_AND_PUSH`, mirroring how `SIMD_v128_bitselect` handles its 3-input shape today — no new stack-tracker macro needed. * The trailing `default:` branch in the type checker keeps rejecting unrecognized SIMD sub-opcodes with `"invalid opcode 0xfd %02x."`, which now correctly surfaces the full uint32 value (relaxed-SIMD opcodes 0x100+ are rendered as e.g. `0xfd 100` — the `%02x` width is a minimum, not a truncation). The runtime executor (the actual case bodies in `HANDLE_OP(WASM_OP_SIMD_PREFIX)` and the IR encoder widening needed to reach them past the existing 1-byte sub-opcode read) is the follow-up commit. Cmake `WAMR_BUILD_RELAXED_SIMD` flag that flips `WASM_ENABLE_RELAXED_SIMD=1` is the third commit. Built clean against `cd390ea0` with the flag absent — no binary or behavioural change to existing SIMD code. References: https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/_md/instructions.md

The 20 relaxed-SIMD ops (`0x100..0x113`) land as new case bodies inside the existing `HANDLE_OP(WASM_OP_SIMD_PREFIX)` switch in `wasm_interp_fast.c`. Each case follows the legacy SIMD-case shape: pop the v128 operand(s) from `frame_lp`, hand them to a SIMDe intrinsic (or a hand lane loop for the three SIMDe-missing ops), write one v128 result. To reach a case past 0xff the SIMD sub-opcode is widened from a single byte to a little-endian uint16 in the IR. The loader emits two consecutive bytes via `wasm_loader_emit_int16` (STORE_U16, no padding even on platforms without unaligned access). The runtime reads `frame_ip[0] | (frame_ip[1] << 8)` and switches over the full `0x000..0x113` range. The widening is conditional on `WASM_ENABLE_RELAXED_SIMD != 0`; when off, the IR is still 1-byte-per-SIMD-op via `emit_byte` and the runtime dispatch is the legacy `GET_OPCODE()` 1-byte read — byte-identical to upstream. Per-case dispatch: swizzle (i8x16 .relaxed_swizzle) DOUBLE trunc_{f32x4,f64x2}_{s,u,_zero} (4 unary) SINGLE {f32,f64}x_relaxed_{madd,nmadd} (4 ternary) TRIPLE {i8,i16,i32,i64}x_relaxed_laneselect (4 ternary) TRIPLE {f32,f64}x_relaxed_{min,max} (4 binary) DOUBLE i16x8.relaxed_q15mulr_s (binary) hand loop i16x8.relaxed_dot_i8x16_i7x16_s (binary) hand loop i32x4.relaxed_dot_i8x16_i7x16_add_s (ternary) hand loop SIMDe's `simde/wasm/relaxed-simd.h` (already shipped in `core/deps/simde`) provides 17 of the 20 intrinsics; q15mulr_s, dot_i8x16_i7x16_s, and dot_i8x16_i7x16_add_s are missing so the dispatch loop inlines a per-lane C implementation. The relaxed- SIMD spec allows implementation-defined behavior on overflow for those three (wrap vs. saturate); the impls here match the strict-IEEE / saturating shape — same as the corresponding non-relaxed ops — which is conformant and matches the SIMDe hand-coded fallbacks for q15mulr_sat_s. A new local `SIMD_TRIPLE_OP(simde_func)` macro pops 3 v128s and hands them to a 3-arg intrinsic; same shape as `SIMD_DOUBLE_OP` / `SIMD_SINGLE_OP` for two- and one-arg ops. `#undef`-ed at the end of the gated block so the macro doesn't leak into the legacy build. Smoke tested via a 6-op WAT module (swizzle, madd, min, laneselect, q15mulr_s, trunc_f32x4_s) compiled to wasm and run through the `iwasm` CLI with `WAMR_BUILD_RELAXED_SIMD=1`: madd = [110, 240, 390, 560] ✓ trunc_f32 = [1, -2, 3, -4] ✓ min = [1, 2, 2, 1] ✓ q15mulr = [0,0,1,1,3,4,6,-7] ✓ swizzle = [15..0] (reverse) ✓ laneselect = (bitwise a/b mux per mask) ✓ The `wasm_loader_prepare_bytecode` SIMD switch type checker (commit 1) is already populated for the new opcodes, so the relaxed-SIMD wasm validates through the loader and then reaches the new dispatch cases here. The cmake flag that exposes the feature (`WAMR_BUILD_RELAXED_SIMD`) is the next commit; this one adds the runtime side gated on the eventual macro.

Lights up the dormant `WASM_FEATURE_RELAXED_SIMD` bit at `aot_runtime.h:32` for the fast interpreter. Default `0` so a build that doesn't explicitly opt in stays byte-identical to upstream — the loader + dispatch added in the two prior commits all sit behind `#if WASM_ENABLE_RELAXED_SIMD != 0`. * `WAMR_BUILD_RELAXED_SIMD=1` adds `-DWASM_ENABLE_RELAXED_SIMD=1` to the C compile line and prints `"Relaxed SIMD enabled"` next to the existing `"SIMD enabled"` line. * `WAMR_BUILD_RELAXED_SIMD=1 WAMR_BUILD_SIMD=0` fails fast with `FATAL_ERROR "WAMR_BUILD_RELAXED_SIMD=1 requires WAMR_BUILD_SIMD=1"`. Relaxed-SIMD is a superset of the base feature — the dispatch loop, frame_lp v128 cells, and SIMDe intrinsics it shares with legacy SIMD would all be compiled out otherwise. * Listed in the existing "feature summary" block alongside `"Fixed-width SIMD"` so `WAMR_INFO` output makes the new knob visible. Verified locally on macOS-15 / aarch64: flag=0 (default): iwasm -f madd /tmp/relaxed_smoke.wasm -> WASM module load failed: invalid opcode 0xfd 100. flag=1: iwasm -f madd /tmp/relaxed_smoke.wasm -> <0x4370000042dc0000 0x440c000043c30000>:v128 (correct f32x4 result for relaxed_madd) flag=1 simd=0: cmake -> "WAMR_BUILD_RELAXED_SIMD=1 requires WAMR_BUILD_SIMD=1" (configure aborts)

The two macros `SIMD_V128_TO_SIMDE_V128` and `SIMDE_V128_TO_SIMD_V128` punt 16-byte values between WAMR's `V128` union-of-arrays and SIMDe's compiler-intrinsic vector type (`int32x4_t` on aarch64, `__m128i` on x86-64) at every SIMD case boundary. The previous shape used `bh_memcpy_s`, which lives out-of-line in `core/shared/utils/bh_common.c`. Without LTO the call doesn't inline, so every conversion compiled into a real `bl` instruction — three function calls on 3-operand SIMD ops (madd / nmadd / laneselect / bitselect / dot_add) plus one on the store, for ~4 function calls per SIMD dispatch. xctrace CPU Counters on the aarch64 M4 E-core, matmul-fma workload (the relaxed-SIMD f32x4_relaxed_madd hot loop): before after Useful 78.1% 71.4% Processing 6.1% 23.3% Delivery 13.4% 2.9% <- frontend stalls, the bottleneck Discarded 2.4% 2.5% total cycles 301M 733M (over 5s vs 10.9s, more iters) The 13.4% `Delivery` share — frontend / L1-I stall — vanished: the SIMD-prefix region's case bodies were big enough (~50 instructions per relaxed_madd dispatch, dominated by `bl memcpy_chk` chains and intermediate v128 spills) to push the SIMD switch out of L1-I on the E-core. After the fix each case body is ~15 instructions, all register-resident, no calls. Per-case disassembly (`f32x4_relaxed_madd`): before after ~50 instructions ~15 instructions 3x bl memcpy_chk 0 calls 4x v128 stack-spill load/store 3 frame_lp loads, 1 frame_lp store, 1 fmla.4s `wasm_interp_call_func_bytecode` total instruction count drops from 14,560 -> 8,735 (40% smaller, comfortably inside the Icestorm 128 KiB L1-I budget alongside hot non-SIMD ops). End-to-end wallclock on M4 E-core (`cargo run --release --bin bench_relaxed_simd`): matmul simd128 (mul+add) WAMR before: 1.490 ms median WAMR after: 0.468 ms median (3.2x speedup) Pulley: 1.217 ms median matmul relaxed-simd (FMA) WAMR before: 1.180 ms median WAMR after: 0.369 ms median (3.2x speedup) Pulley: 0.921 ms median WAMR now leads Pulley on both shapes (1.27x faster on matmul-simd128, 2.50x faster on matmul-fma), and WasmEdge interp by 6-7x. The fix applies to *all* SIMD ops, not just the relaxed-SIMD ones — the macros are on the hot path for every f32x4 / i32x4 / v128.load / v128.store in the fast interpreter. Correctness: `_Static_assert` upgrades the `bh_assert` size-equality guard from runtime to compile-time so a future divergence between V128 and simde_v128_t trips the build rather than silently miscompiling. Semantically identical to the pre-fix `bh_memcpy_s` for these fixed-size copies.

@lum1n0us

…ts/unit Anticipates and addresses common WAMR maintainer review feedback on the relaxed-SIMD PR (#3): * **HIGH — silent AOT mis-compile when RELAXED_SIMD=1 AOT=1**: the shared loader `prepare_bytecode` (`wasm_loader.c`) is reached by AOT/JIT/wamrc too. With this PR's commit 1 it accepts the new sub-opcodes 0x100..0x113, but the AOT path in `core/iwasm/compilation/aot_compiler.c:1494,2463,2639,2799` does `opcode = (uint8)opcode1`, silently aliasing `relaxed_swizzle` (0x100) into `SIMD_v128_load` (0x00) and reading a garbage memarg at codegen time. Reject the combination at cmake-configure time: `WAMR_BUILD_RELAXED_SIMD=1` now requires `WAMR_BUILD_FAST_INTERP=1` and explicitly rejects `WAMR_BUILD_AOT=1 / WAMR_BUILD_JIT=1 / WAMR_BUILD_FAST_JIT=1 / WAMR_BUILD_WAMR_COMPILER=1` with a diagnostic that points at `aot_compiler.c` and says "build fast-interp-only to use relaxed-SIMD until the AOT/JIT pipelines learn the wider sub-opcode range." * **`core/config.h` default for `WASM_ENABLE_RELAXED_SIMD`**: `#ifndef … #define … 0 #endif` block alongside `WASM_ENABLE_SIMD` and `WASM_ENABLE_SIMDE`. Cosmetic but matches WAMR's pattern for every other feature flag — non-cmake builds (e.g. CI lint that compiles a TU in isolation) still see a defined value. * **`tests/unit/relaxed-simd/`**: gtest-based unit test that loads + invokes a hand-encoded wasm module with `f32x4.relaxed_madd`. Two tests: - `load_module_with_relaxed_madd`: asserts the loader accepts the module (pre-PR, this fails with `"invalid opcode 0xfd 100"`). - `invoke_relaxed_madd_returns_fma_result`: invokes the export, asserts the bit pattern of two f32 lanes (`0x42DC0000` = 110.0 and `0x43700000` = 240.0) — both single-rounded FMA hardware and split mul+add produce the same result here since every input/product/sum is exactly representable in f32. Wired into `tests/unit/CMakeLists.txt` next to the parallel `exception-handling` test target. Gated on `WAMR_BUILD_RELAXED_SIMD=1 + WAMR_BUILD_FAST_INTERP=1`. * **Hand-rolled `q15mulr_s` swap → SIMDe intrinsic**: the patch-2 case body for `SIMD_i16x8_relaxed_q15mulr_s` previously had a lane-by-lane fallback loop (because SIMDe doesn't ship a `relaxed_q15mulr_s` intrinsic). SIMDe DOES ship the non-relaxed `simde_wasm_i16x8_q15mulr_sat` (strict-saturating `sqrdmulh.h8` on aarch64), and the relaxed spec explicitly permits saturating behaviour. Swap to that — smaller code, NEON hardware path, bit-identical to the hand loop on the INT16_MIN² overflow boundary (verified locally via `q15mulr_overflow` test case: both produce 0x7ffe7fff7fff). * Docs nit: comment in patch-2 `HANDLE_OP(WASM_OP_SIMD_PREFIX)` referenced `emit_uint16(opcode1)` but the actual call is `wasm_loader_emit_int16(opcode1)`. Fixed. Audit items verified OK without code change: - `clang-format-14` clean across all 5 commits. - `-Wpedantic` not enabled in `build-scripts/warnings.cmake` so the `({ })` GCC statement-expression in the V128 conversion macros is fine. - IR encoding's 2-byte sub-opcode advance via `wasm_loader_emit_int16` is safe on non-unaligned platforms (STORE_U16 with alignment asserts; legacy `emit_byte` also consumed 2 bytes there via padding). - `WASM_ENABLE_SIMDE` is always set when SIMD+FAST_INTERP are set, so the nested `#include "simde/wasm/relaxed-simd.h"` can't be reached without SIMDe being in scope. - `AOT_CURRENT_VERSION` correctly not bumped — no AOT struct changed. References: WAMR PR bytecodealliance#4713 (woodsmc) made tests mandatory in CONTRIBUTING.md; `@lum1n0us`'s PR bytecodealliance#4837 review pattern on fast-interp EH ("follow `tests/unit/interpreter`") shapes the new `tests/unit/relaxed-simd/` layout. CODEOWNERS will route review to `@loganek @lum1n0us @no1wudi @TianlongLiang @yamt`.

…diate Reviewer note (chatgpt-codex-connector on #3): summing all four i8 byte products directly into the i32 lane skipped the i16 truncation point that the spec defines via i16x8.relaxed_dot + extadd_pairwise_i16x8_s. For lanes with a=b=0x80, the previous impl produced 65536+c, which is outside the spec-allowed result set {-65536+c, 65534+c, -1+c} (wrap or saturate at each of two pair sums). Fix preserves the i16 intermediate using wrap, matching the i16x8 dot case immediately above. Worked example, a=b=0x80 in all four lanes: lo_pair = (-128*-128) + (-128*-128) = 32768 (int16)32768 = -32768 (wrap) hi_pair = 32768 → -32768 ext_sum = (i32)-32768 + (i32)-32768 = -65536 result = -65536 + c ✓ wrap+wrap allowed value

Two new tests for the chatgpt-codex-connector finding on #3: 1. `dot_add_i16_intermediate_overflow_regression` — pins the spec-conformant -65536 result for the input pattern that used to produce 65536 (outside the spec-allowed set {-65536, -1, 65534}). Future refactor back to a direct-i32- sum impl fails immediately. 2. `dot_s_i16_overflow_pin_sibling_op` — pins the sibling `i16x8.relaxed_dot_i8x16_i7x16_s` impl at the same overflow boundary. The current impl correctly truncates via the `(int16)sum` cast (wasm_interp_fast.c:8103); the test makes a future refactor that drops the cast loudly fail. Both inputs use a = b = 0x80 in all 16 bytes — the classic case where the i8×i8 pair sum overflows i16 and the truncation point between "i16x8 relaxed dot" and "extadd_pairwise_i16x8_s" distinguishes spec-conformant impls from naive direct-sum impls. Bytecode for both modules was generated via `wat2wasm --enable-relaxed-simd` on minimal known-good WAT (documented inline in the static-array comments) and inlined to avoid a wabt/wat-runtime dependency at test time.

The Coding Guidelines CI check uses `clang-format-14` and flagged the line break I chose in the previous "preserve i16 intermediate" commit. Newer clang-format-22 happens to accept both shapes; clang-format-14 prefers the cast-then-paren-group form: result.i32x4[lane] = (int32)((uint32)ext_sum + (uint32)v3.i32x4[lane]); Functionally identical. No behaviour change.

Two more relaxed-SIMD boundary tests in the unit suite, both exercising implementation-defined behaviors that the dot-product regression-tests already established for this PR but that weren't yet covered for these ops: 1. `q15mulr_int16_min_squared_either_sat_or_wrap` — the INT16_MIN * INT16_MIN case. Spec relaxes the result of `sat_s((a*b + 0x4000) >> 15)` so an implementation may pick either the IEEE/x86 PMULHRSW saturate (0x7fff) or the truncate (0x8000). Test uses *membership* (either of the two allowed values) rather than exact equality, so a future switch to wrap doesn't break the test. 2. `madd_inf_times_zero_propagates_nan` — adversarial input for the fused/unfused FMA path (`f32x4.relaxed_madd`). IEEE 754 §7.2 makes `Inf * 0` an invalid multiply that produces NaN regardless of the subsequent add, so both `fma(Inf, 0, c)` and unfused `Inf * 0 + c` produce *some* NaN — but the specific NaN bit pattern is impl-defined. Test checks each lane against the IEEE-754 NaN predicate (exp == 0xff and fraction != 0) rather than an exact bit pattern. Locally exercised via `iwasm -f`: q15mulr result: 0x7fff (saturate, current SIMDe lowering) madd_inf_times_zero result: 0x7fc00000 per lane (canonical f32 NaN) Both fit the spec-allowed sets the tests describe; the membership assertions confirm without overfitting to the specific bit pattern.

matthargett added 9 commits May 18, 2026 23:00

matthargett requested review from TianlongLiang, lum1n0us, no1wudi and yamt as code owners May 21, 2026 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast-interp: relaxed-SIMD opcode lowering#4950

fast-interp: relaxed-SIMD opcode lowering#4950
matthargett wants to merge 9 commits into
bytecodealliance:mainfrom
rebeckerspecialties:feat/relaxed-simd-fast-interp

matthargett commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matthargett commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matthargett commented May 21, 2026 •

edited

Loading